28 research outputs found
Investigating Language Impact in Bilingual Approaches for Computational Language Documentation
For endangered languages, data collection campaigns have to accommodate the
challenge that many of them are from oral tradition, and producing
transcriptions is costly. Therefore, it is fundamental to translate them into a
widely spoken language to ensure interpretability of the recordings. In this
paper we investigate how the choice of translation language affects the
posterior documentation work and potential automatic approaches which will work
on top of the produced bilingual corpus. For answering this question, we use
the MaSS multilingual speech corpus (Boito et al., 2020) for creating 56
bilingual pairs that we apply to the task of low-resource unsupervised word
segmentation and alignment. Our results highlight that the choice of language
for translation influences the word segmentation performance, and that
different lexicons are learned by using different aligned translations. Lastly,
this paper proposes a hybrid approach for bilingual word segmentation,
combining boundary clues extracted from a non-parametric Bayesian model
(Goldwater et al., 2009a) with the attentional word segmentation neural model
from Godard et al. (2018). Our results suggest that incorporating these clues
into the neural models' input representation increases their translation and
alignment quality, specially for challenging language pairs.Comment: Accepted to 1st Joint SLTU and CCURL Worksho
A Study of Gender Impact in Self-supervised Models for Speech-to-Text Systems
Self-supervised models for speech processing emerged recently as popular
foundation blocks in speech processing pipelines. These models are pre-trained
on unlabeled audio data and then used in speech processing downstream tasks
such as automatic speech recognition (ASR) or speech translation (ST). Since
these models are now used in research and industrial systems alike, it becomes
necessary to understand the impact caused by some features such as gender
distribution within pre-training data. Using French as our investigation
language, we train and compare gender-specific wav2vec 2.0 models against
models containing different degrees of gender balance in their pre-training
data. The comparison is performed by applying these models to two
speech-to-text downstream tasks: ASR and ST. Our results show that the type of
downstream integration matters. We observe lower overall performance using
gender-specific pre-training before fine-tuning an end-to-end ASR system.
However, when self-supervised models are used as feature extractors, the
overall ASR and ST results follow more complex patterns, in which the balanced
pre-trained model is not necessarily the best option. Lastly, our crude
'fairness' metric, the relative performance difference measured between female
and male test sets, does not display a strong variation from balanced to
gender-specific pre-trained wav2vec 2.0 models.Comment: submitted to INTERSPEECH 202
NAVER LABS Europe's Multilingual Speech Translation Systems for the IWSLT 2023 Low-Resource Track
This paper presents NAVER LABS Europe's systems for Tamasheq-French and
Quechua-Spanish speech translation in the IWSLT 2023 Low-Resource track. Our
work attempts to maximize translation quality in low-resource settings using
multilingual parameter-efficient solutions that leverage strong pre-trained
models. Our primary submission for Tamasheq outperforms the previous state of
the art by 7.5 BLEU points on the IWSLT 2022 test set, and achieves 23.6 BLEU
on this year's test set, outperforming the second best participant by 7.7
points. For Quechua, we also rank first and achieve 17.7 BLEU, despite having
only two hours of translation data. Finally, we show that our proposed
multilingual architecture is also competitive for high-resource languages,
outperforming the best unconstrained submission to the IWSLT 2021 Multilingual
track, despite using much less training data and compute.Comment: IWSLT 2023: Tamasheq-French and Quechua-Spanish challenge winne
ON-TRAC Consortium End-to-End Speech Translation Systems for the IWSLT 2019 Shared Task
International audienceThis paper describes the ON-TRAC Consortium translation systems developed for the end-to-end model task of IWSLT Evaluation 2019 for the English→ Portuguese language pair. ON-TRAC Consortium is composed of researchers from three French academic laboratories: LIA (Avignon Univer-sité), LIG (Université Grenoble Alpes), and LIUM (Le Mans Université). A single end-to-end model built as a neural encoder-decoder architecture with attention mechanism was used for two primary submissions corresponding to the two EN-PT evaluations sets: (1) TED (MuST-C) and (2) How2. In this paper, we notably investigate impact of pooling heterogeneous corpora for training, impact of target tokeniza-tion (characters or BPEs), impact of speech input segmenta-tion and we also compare our best end-to-end model (BLEU of 26.91 on MuST-C and 43.82 on How2 validation sets) to a pipeline (ASR+MT) approach
LeBenchmark: A Reproducible Framework for Assessing Self-Supervised Representation Learning from Speech
Self-Supervised Learning (SSL) using huge unlabeled data has been
successfully explored for image and natural language processing. Recent works
also investigated SSL from speech. They were notably successful to improve
performance on downstream tasks such as automatic speech recognition (ASR).
While these works suggest it is possible to reduce dependence on labeled data
for building efficient speech systems, their evaluation was mostly made on ASR
and using multiple and heterogeneous experimental settings (most of them for
English). This questions the objective comparison of SSL approaches and the
evaluation of their impact on building speech systems. In this paper, we
propose LeBenchmark: a reproducible framework for assessing SSL from speech. It
not only includes ASR (high and low resource) tasks but also spoken language
understanding, speech translation and emotion recognition. We also focus on
speech technologies in a language different than English: French. SSL models of
different sizes are trained from carefully sourced and documented datasets.
Experiments show that SSL is beneficial for most but not all tasks which
confirms the need for exhaustive and reliable benchmarks to evaluate its real
impact. LeBenchmark is shared with the scientific community for reproducible
research in SSL from speech.Comment: Will be presented at Interspeech 202
LeBenchmark 2.0: a Standardized, Replicable and Enhanced Framework for Self-supervised Representations of French Speech
Self-supervised learning (SSL) is at the origin of unprecedented improvements
in many different domains including computer vision and natural language
processing. Speech processing drastically benefitted from SSL as most of the
current domain-related tasks are now being approached with pre-trained models.
This work introduces LeBenchmark 2.0 an open-source framework for assessing and
building SSL-equipped French speech technologies. It includes documented,
large-scale and heterogeneous corpora with up to 14,000 hours of heterogeneous
speech, ten pre-trained SSL wav2vec 2.0 models containing from 26 million to
one billion learnable parameters shared with the community, and an evaluation
protocol made of six downstream tasks to complement existing benchmarks.
LeBenchmark 2.0 also presents unique perspectives on pre-trained SSL models for
speech with the investigation of frozen versus fine-tuned downstream models,
task-agnostic versus task-specific pre-trained models as well as a discussion
on the carbon footprint of large-scale model training.Comment: Under submission at Computer Science and Language. Preprint allowe
Findings of the IWSLT 2022 Evaluation Campaign.
The evaluation campaign of the 19th International Conference on Spoken Language Translation featured eight shared tasks: (i) Simultaneous speech translation, (ii) Offline speech translation, (iii) Speech to speech translation, (iv) Low-resource speech translation, (v) Multilingual speech translation, (vi) Dialect speech translation, (vii) Formality control for speech translation, (viii) Isometric speech translation. A total of 27 teams participated in at least one of the shared tasks. This paper details, for each shared task, the purpose of the task, the data that were released, the evaluation metrics that were applied, the submissions that were received and the results that were achieved